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Abstract — Network coding-based storage has recently received a lot of attention in the network coding community. Independently, 
another body of work has proposed integrity checking schemes for cloud storage, none of which, however, is customized for network 
coding storage or can efficiently support repair. In this work, we bridge the gap between these currently disconnected bodies of work, 
and we focus on the (novel) advantage of network coding for integrity checking. We propose NC-Audit - a remote data integrity checking 
scheme, designed specifically for network coding-based storage cloud. NC-Audit provides a unique combination of desired properties: 
(i) efficient checking of data integrity (ii) efficient support for repairing failed nodes (ill) full support for modification of outsourced data and 
(iv) protection against information leakage when checking is performed by a third party. The key ingredient of the design of NC-Audit 
is a novel combination of SpaceMac, a homomorphic MAC scheme for network coding, and NCrypt, a novel CPA-secure encryption 
scheme that is compatible with SpaceMac. Our evaluation of a Java implementation of NC-Audit shows that an audit costs the storage 
node and the auditor only a few milliseconds of computation time, and lower bandwidth than prior work. 



1 Introduction 

FUNDAMENTAL to cloud computing is the ability to 
store user data reliably on the storage cloud. If the 
original data consists of K packets, an {N, K) maximum 
distance separable (MDS) code is typically used to pro- 
duce N packets to be stored individually on N storage 
nodes, thus tolerating up to {N ~ K) node failures. 
Network coding (NC) has been shown to achieve the 
minimum repair bandwidth - much less than K packets, 
which is required to reconstruct the original data fl], 
S. The key ingredients of NC-based distributed storage 
include (i) subpaketization, i.e., each storage node stores 
subpackets (or blocks) that are linear combinations of 
blocks that form the original data, and (ii) subpacket 
mixing w^hen repairing. An example is given in Fig. IT] 
However, repair bandwidth is only one aspect of cloud 
storage. 

Another practical aspect, which has received only 
modest attention in the network coding community, is 
integrity checking of the data stored on the cloud. Data 
can be lost or corrupted for various reasons without the 
user being aware of it. For example, storage errors, such 
as torn writes LSJ and latent errors 14], may damage the 
data in a way that is not detected. Data storage providers 
also have incentives to cheat: e.g., some providers do 
not report data loss incidents in order to maintain their 
reputation ||5)-|[7|. This problem is further exacerbated 
in NC-based systems because corrupted data on one 
storage node can propagate to many other nodes during 
the repair process. Therefore, it is important for the user 
to be able to audit the integrity of the data stored on the 
cloud. 

However, considering a large file stored on the cloud, 
the ability to audit this file regularly may be out of 
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Fig. 1. Repairing a failed node (Tj: Tine original data 
consists of four blocks: bi,b2,b3 and b4. A (4,2) MDS 
code is used such that any 2 nodes can be used to restore 
the original data. Note that the repair involves combining 
blocks bs and h^ and the repair bandwidth consists of 3 
blocks instead of 4, which is needed to reconstruct the 
whole data. 

the ability or budget of users with limited resources 
W\, ISl. Therefore, users often resort to a third party 
to perform the audit on their behalf |5|, |7|, ||9||, \lOj. 
In this latter case, it is important that the auditing 
protocol be privacy-preserving, i.e., it should not leak 
the data to the third party |7|, |11|. Indeed, the users 
can leverage data encryption to protect their data before 
outsourcing it [lOj. However, data encryption should 
be complementary and orthogonal to integrity checking 
protocols. Furthermore, the users may want to outsource 
unencrypted instead of encrypted data to support more 
efficient and/or complex computation over the data. 

Although there is a rich literature on auditing proto- 
cols for cloud storage in general l[5|-|J7|, |[9)-p7), there 
have been very few auditing protocols for NC-based 
distributed storage systems |18|, |19]. These protocols, 
however, are generic in the sense that they do not 
specifically exploit network coding properties for effi- 
cient integrity checking |18|. Furthermore, they do not 
efficiently support repair or data dynamics ||18), and do 



not prevent data leakage |18|, p9[ . 

In this work, we propose a symmetric key-based 
cryptographic protocol, called NC-Audit, to check for 
the integrity of data stored on a NC -based distributed 
storage system. To the best of our knowledge, this is 
the first scheme proposed for NC-based systems that 
possesses all the following properties: 

(i) Efficient Integrity Checking: The integrity check 
incurs a small bandwidth and computational over- 
head (few milliseconds). It guarantees that, with 
high probability, the storage provider passes the in- 
tegrity check if and only if it possesses the data. The 
proposed protocol also supports unlimited number 
of checks. 

(ii) Efficient Support for Repair and Data Dynamics: 
The repair of failed nodes and the changes made 
to the data (including update, append, insert, and 
delete operations) require negligible bandwidth (no 
data download) and computation (sub millisec- 
onds) for maintaining the metadata used by the 
integrity checking. 

(iii) Efficient Privacy Protection: A third party audi- 
tor cannot learn any information about the user 
data through the checking protocol, except for the 
metadata used by the integrity checking. This pri- 
vacy preserving property incurs a small bandwidth 
(0.4%) and computational overhead (few millisec- 
onds). 

We would like to emphasize that, independently of 
(iii), properties (i) and (ii) together are already useful 
and of interest to users who prefer to audit the data 
themselves; furthermore, NC-Audit is also the first pro- 
tocol that possesses (i) and (ii) at the same time. In 
addition, NC-Audit is the first auditing scheme that fully 
exploits network coding by design. The key ingredient of 
NC-Audit is a novel combination of SpaceMac - a homo- 
morphic authenticator that was previously specifically 
designed for network coding |[20), |[2l|, and N Crypt - 
a novel encryption scheme that exploits random linear 
combinations so as to be compatible with SpaceMac 
(Section lO). 

We implemented NC-Audit in Java, utilizing our pre- 
vious implementation of SpaceMac 121]. Our evaluation 
of NC-Audit shows that it has very low computational 
overhead: when performing an audit, both the storage 
node and the third party auditor only needs to spend a 
couple of milliseconds. 

The rest of the paper is organized as follows. In 
Section |2] we discuss related work. In Section |3] we 
formulate the problem and describe the threat model. In 
Section El we describe the auditing framework and the 
key building blocks of NC-Audit (SpaceMac and NCrypt) 
before presenting NC-Audit itself. In Section Isl we show 
how NC-Audit efficiently supports repair and data dy- 
namics. In Sectionl6l we analyze the security of NC-Audit. 
In Section [TI we evaluate its storage, bandwidth, and 
computational efficiency. In Section [8] we conclude. 



2 Related Work 

2.1 Integrity Checking for Remote Data 

There has been a rich body of work on integrity check- 
ing for remote data [51-171, 191-1171, known as Proof of 
Retrievability and Proof of Data Possession. 



Proof of Retrievability (POR). In |10|, Juels and 



Kaliski introduced the notion of POR, where a POR 
enables a client (verifier) to determine that the server 
(prover) possesses a file or data object. Furthermore, a 
successful execution of POR would allow a verifier to 
extract the file from the proof. The main POR scheme 
presented in this work uses sentinels, i.e., small check 
blocks, that are inserted into the outsourced data to 
guard against large file corruption. At the same time, 
it also utilizes error correcting codes to protect against 
small file corruption. This scheme can only handle a 
limited number of queries, which has to be fixed a priori. 
NC-Audit does not use sentinels and supports unlimited 
number of queries. 

In 19 1, Shacham and Waters proposed two POR 
schemes with full proofs of security and extract-ability 
The first one, built on BLS signatures, provides public 
verifiability The second one, built on pseudorandom 
functions (PRFs), provides private verifiability. Both of 
these schemes exploit homomorphic properties to aggre- 
gate authenticator values. NC-Audit also exploits homo- 
morphic properties and provides private verifiability. 

Proof of Data Possession (PDP). The notion of PDP 
was introduced by Ateniese et al. \5\. The PDP scheme 
in 151 uses homomorphic RSA signatures to generate 
verification tags. The data possession guarantee pro- 
vided by this scheme is under the RSA and KEAl 12!?1 
assumptions in the random oracle model. As discussed 
in 191, the notion of PDP is considered weaker than POR. 
This is because in POR, a successful audit guarantees 
that all the data can be extracted while in PDP, only a 
certain percentage of the data {e.g., 90%) is guaranteed 
to be available. We will show that NC-Audit provides the 
stronger data possession as in POR (Section 6.1 1. 

Data Dynamics. In ||T2|, Ateniese et al. proposed a 
symmetric-key based checking scheme that supports 
data dynamics. This scheme is built on regular PRFs, 
hash functions, and encryptions. It provides private ver- 
ifiability and only supports a limited number of queries. 
In 1141, Erway et al. proposed an auditing scheme built 
on rank-based authenticated skip lists and requires the 
storage server to maintain the lists for verification. It 
provides private verifiability but could be extended to 
provide public verifiability 1141. In 1151, Wang et al. pro- 
posed a public auditing scheme that uses a combination 
of the BLS-based scheme in 191 and Merkle Hash Tree 
(MHT). NC-Audit supports data d5n-iamics and does not 
require data block download (blockless) in all operations 
(Sectio n |5.2| . The approach taken by NC-Audit is simi- 
lar to |12|1 but different from ||15) and fli), where the 
changes are immediately verified by the user. 



Privacy Preserving. In ||6|, Shah et al. proposed an 
auditing protocol that is privacy preserving. This pro- 
tocol first encrypts the data and then send a number 
of message authentication code (MAC) tags of the en- 
crypted data to the auditor. The auditor verifies both the 
outsourced data and the outsourced encryption key. This 
approach only works on encrypted files. It also requires 
the auditor to maintain states and supports only limited 
number of audits. In IllJ, Wang et al. also proposed 
a privacy preserving auditing protocol that has public 
verifiability This protocol can be considered an extension 
of the BLS-based protocol in 19]. In this approach, the 
aggregated (proving) block sent by the storage server is 
masked with a random element to protect the privacy 
of the block. NC-Audit is explicitly designed to provide 
privacy preserving-auditing (Section 4.5 and |6.2| . In 
particular, NC-Audit provides privacy by encrypting the 
response block. 

We stress that none of the schemes described above 
was customized for NC-based storage; thus, they do not 
provide efficient support for node repair. NC-Audit was 
designed to achieve all the above good properties while 
efficiently supporting repair. 

2.2 Integrity Checking for NC-based Storage Sys- 
tems 

NC-based Storage Systems. The benefits of network 
coding for distributed storage were first formalized by 
the work of Dimakis et al. 12]. In particular, in [2], 
the authors proposed the notion of regenerating codes 
and show that they can significantly reduce the repair 
bandwidth. This work showed the fundamental tradeoff 
between node storage and repair bandwidth and pro- 
posed regenerating codes that can achieve any point on 
the optimal tradeoff curve. An excellent survey on recent 
advances in NC-based storage system can be found at 
I IJ. A wiki on NC-based storage cloud is maintained at 
[23). NC-Audit is designed to fully support regenerating 
codes. 

One of the first implementations of NC-based stor- 
age cloud is NCCloud by Hu et al. |24|. In partic- 
ular, NCCloud is a proxy-based system for multiple- 
cloud storage. It utilizes a functional minimum-storage 
regenerating code to provide cost-effective repair for a 
permanent single-cloud failure. This efficient repair is 
achieved without the cost of storage or redundancy level. 
NCCloud protot5^e was deployed atop Windows Azure 
Storage. 

Integrity Checking Schemes. There have been only a 
few work that provide remote data checking for NC- 
based storage. In p9) , Dikialotis et al. proposed an in- 
tegrity checking scheme that utilizes the error-correction 
capabilities of the storage system. This scheme aims to 
detect errors with a very small amount of bandwidth. 
The key technique for reducing the bandwidth is to 
project data blocks onto a small random vector. This 
checking scheme is inherently different from NC-Audit as 



it relies on the communication between the auditor and 
multiple nodes to perform a single check while NC-Audit 
does not. Moreover, this scheme is information-theory 
based while NC-Audit leverages cryptographic primitives 
to provide the checking. 

A more recent integrity checking scheme for NC- 
based storage was proposed in p8) . In this work, Chen 
et al. adopted the symmetric-key based scheme that 
Shacham and Waters proposed for regular cloud storage 
[9| with minor modification. In particular, atop of the 
symmetric-key based scheme in |9|, the scheme in |18| 
proposed to encrypt the coding coefficients of the out- 
sourced encoded blocks to prevent replay attacks, w^here a 
malicious storage node may store old (incorrect) encoded 
blocks instead of the new (correct) encoded blocks as 
required by the repair |18|. NC-Audit overcomes this 
attack by requiring the user /auditor to store the coding 
coefficients, which is needed for the repair process and 
only occupies a small amount of storage and could be 
made constant (Section 7.1 1. 

The scheme in |18| also protects the repair phase 
against pollution attacks, i.e., preventing remaining 
nodes from sending corrupted data to the new (recov- 
ering) node. In |18|, the authors proposed that the user 
acts as the middle man, i.e., downloads necessary blocks 
for repair, checks for their integrity, and constructs and 
sends the new blocks to the new node. This approach 
puts a heavy bandwidth and computational overhead 
on the user. Dealing with pollution attacks is out of 
the scope of this work. We refer the reader to the 
rich literature, including our previous work, that deal 
with pollution attacks I^Sl-fSTl. We stress that w^hen 
leveraging other pollution defense approaches I25J-I31J, 
the new node can detect a pollution attack itself without 
resorting to the user acting as the middle man, thereby 
eliminating both user's bandwidth and computational 
overhead. 

Finally, the scheme in flS] neither supports data dy- 
namics nor privacy-preserving auditing w^hile NC-Audit 
does. We provide detailed performance comparison be- 
tween NC-Audit and flSl in Section |7| 

Other Security Issues. Other security problems for 
NC-based storage include protecting the privacy and 
integrity of the blocks w^hile repairing. On one hand, the 
work in 1 32 1 and |33| prevent eavesdroppers from access- 
ing/decoding all the data. In |32|, Pawar et al. provide 
an explicit code construction that achieves the secrecy 
capacity for the bandwidth-limited regime of the storage 
systems under repair djrnamics. In fSS], Rouayheb et 
al. analyze the effects of interaction between the storage 
nodes on the amount of data revealed to the eavesdrop- 
pers. On the other hand, the work in |34| and |35| aim to 
provide protection against pollution attacks during the 
repair. In |34|, the Pawar et al. provide upper bounds on 
the maximum amount of information that can be stored 
safely when there are malicious nodes. In [35], Buttyan 
et al. provide a lightweight, pollution-resilient decoding 
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Fig. 2. Parties and Steps Involved in NC-Audit. 
algorithm that is capable of finding adversarial blocks. 

2.3 Our Work in Perspective 

The preliminary version of this work has appeared in 
||36 1. This work is an important extension of our previous 
6-page version |36|. In particular, this work provides 
complete proofs of all theorems. It also provides a com- 
plete discussion of all data update operations, including 
block update, append, insert, and delete. Furthermore, 
we discuss and compare our storage overhead with the 
prior work pT| , p5) , ||18|. Finally, we provide this work 
with a comprehensive discussion of related literature. 

3 Probleiw Forimulation 

3.1 System IVIodel and Operations 

Fig. [2] illustrates an overview of the system. We consider 
a cloud storage service that involves three entities: a 
user, NC-based storage nodes, which make up the stor- 
age cloud, and a third party auditor (TPA). The user 
distributes her data on the storage nodes and may also 
dynamically update her data. The user resorts to a TPA 
to check for the integrity of her data stored at each node; 
at the same time, she does not want the TPA to learn 
about her data. We assume that the user is responsible 
for maintaining the data stored at each storage node. Our 
work, however, is also applicable to the scenario where 
there is a cloud service provider who is independent 
from the user and is responsible for maintaining the 
storage cloud. 

The user follows the following basic steps to store her 
data on the storage cloud. We adopt the notations used in 
pO) . Denote the original file by J^. The user first divides 
JF into m blocks, bi , • • • , b„i. Each block is a vector in an 
n-dimensional linear space F" where F is a finite field of 
size q. To facilitate the decoding, the user then augments 
each block b.^ with its m global coding coefficients. The 
resulting blocks, hi, have the following form: 



(— b— ,0,---,0,l,0,--- ,0) eF"^ 



We call hi source blocks and the space spanned by them 
source space, denoted by 11. We use aug(bi) to denote the 
coefficients of b^. Typically, n ^ m, and this presentation 
is also called an n-extended version of a storage code 

HI- 

The user then creates a number of encoded blocks 
using an appropriate linear coding scheme for the de- 
sired reliability, e.g., an array MDS evenodd code is used 
in Fig. IT] Each encoded block is a linear combination 
of the source blocks. Note that if an encoded block e 
equals X]"=i '^i ^i' then the last m coordinates of e are 
exactly the coding coefficients a/s. These encoded blocks 
are then distributed across the A^ storage nodes of the 
storage cloud. Let M be the number of encoded blocks 
stored at a storage node. In the example given in Fig. [ll 
m = 4, A^ = 4, and M = 2. 

3.2 Threat IVIodel 

We adopt the threat model considered in pT] and 1 16 1. In 
particular, we consider semi-trusted storage nodes who 
behave properly and do not deviate from the prescribed 
protocol. However, for their own benefit, the nodes may 
deliberately delete rarely accessed user's data. They may 
also decide to hide data corruptions, caused by either 
internal or external factors, to maintain reputation. For 
clarity, we focus our discussion on a single storage node 
except when discussing the repair process. 

Similar to | |lT| , we assume that the TPA, who is in 
the business of auditing, is reliable and independent. 
The TPA has no incentive to collude with the user or 
the storage node during the auditing process. The TPA, 
however, must not be able to learn any information 
about the user 's data through the auditing process, aside 
from the metadata needed for the auditing. 

In summary, the threat model includes a malicious 
storage node, who wants to hide data corruption, and 
a TPA, who wants to learn about the user's data. We 
assume that both the node and the TPA are fully aware of 
all the cr5^tographic constructions and protocols used; 
however, their runtime is pol5n-iomial in the security 
parameter. 

4 Auditing Scheme 

4.1 Definitions and Auditing Framework 

We follow the literature on checking the integrity of 
remote data ISJ, l|9J-|[ll), ||13| and adapt the proposed 
framework to our privacy-preserving auditing system. 
In particular, we consider an auditing scheme which 
consists of four algorithms: 

• KeyGen(l^) — > (fci,fc2) is a probabilistic key gen- 
eration algorithm that is run by the user to setup 
the scheme. It takes a security parameter A as input 
and outputs two different private keys, ki and fc2. 
fci is used to generate verification metadata, and fc2 
is used to encrypt the possession proof. 

• TagGen(e, fci) -^ t is a probabilistic algorithm run 
by the user to generate the verification metadata. It 



takes as input a coded block, e, a private key, fci, 
and outputs a verification data of e, t. 
. GenProof(A:2,(ei,--- ,eM),(tei,--- ,ie„),chal) -^ V 
is run by the storage node to generate a proof of 
possession. It takes as input a secret key, k2, coded 
blocks stored at the node, ei,-- ,ej^i, their corre- 
sponding verification metadata, tei, ■ • • i teu, and a 
challenge, chal. It outputs a proof of possession, V, 
for the coded blocks determined by chal. 

• VerifyProof(fci,chal, V^) -^ {1,0} is run by the user 
in order to validate a proof of possession. It takes 
as inputs a secret key fci, a challenge, chal, and 
a proof of possession V. It returns 1 (success) if 
V is the correct proof of possession for the blocks 
determined by chal and (failure) otherwise. 

An auditing system can be constructed from the above 
algorithms and consists of two phases: 

• Setup: The user initializes the security parameters 
of the system by running KeyGen. The encoded 



blocks are prepared as described in Section 3.1 



The user then runs TagGen to generate verification 
metadata for each encoded block. Afterwards, both 
the encoded blocks and verification metadata are 
uploaded to the storage node. The encoded blocks 
are then deleted from the user's local storage. Fi- 
nally, the user sends metadata needed to perform 
the audit to the TPA. 

• Audit: The TPA issues an audit message, i.e., a 
chal, to the storage node to make sure that the 
node correctly stores its assigned coded blocks. The 
node generates a proof of possession for the blocks 
specified in chal by running Gen Proof, and it sends 
the possession proof back to the TPA. Finally, the 
TPA runs VerifyProof to verify the possession proof 
it receives. 

4.2 Basic Scheme and Key Techniques 

Here we describe the most basic scheme that supports 
remote data checking and show that it does not provide 
the desired properties. This basic scheme is also de- 
scribed in |5|. Afterwards, we describe how we improve 
this basic scheme to arrive at our proposed scheme. 

The Basic Scheme. During the Setup phase, the user 
precomputes a message authentication code (MAC) tag, 
ti, for each coded block, e^, using a secret key, fci, and 
a standard MAC scheme, e.g., HMAC. She uploads both 
the tags and the coded blocks to the storage node and 
sends fci to the TPA. During the Audit phase, to verify 
that the node stores e^ correctly, the TPA issues a request 
for Gj. The node then sends e^ and its tag i, to the TPA. 
The TPA can use fci and ti to check for the integrity 
of e^. Although providing the possession checking, this 
scheme suffers from many drawbacks: 

• It is inefficient in both computation and communica- 
tion, i.e., the computation and bandwidth overhead 
increases linearly in the number of checked blocks. 



• It does not efficiently support repair ITJ, p): it re- 
quires the user to download all the coded blocks 
to be stored at the new (recovering) node then 
compute the verification tag for each of the block, 
essentially re-setting up the storage node. 

• It violates privacy as the TPA learns about the 
blocks. Note that the straightforward way to pro- 
vide privacy is to encrypt the response block using 
a standard encryption scheme, e.g., AES. However, 
in this way, the TPA will not be able to verify the 
integrity of the original block from the provided 
encrypted block. 

Key Techniques. We improve the basic scheme to arrive 
at our proposed scheme by leveraging (i) a homomor- 
phic MAC scheme and (ii) a customized encryption 
scheme that exploits properties of linear network coding. 

In particular, we adopt Space Mac, a homomorphic 
MAC scheme that we previously designed specifically 
for network coding |20|, [31 1. We use SpaceMac to gen- 
erate verification tags. With SpaceMac, the integrity of 
multiple blocks can be verified with the computation and 
communication cost of a single block verification, thanks 
to the ability to combine blocks and tags. SpaceMac also 
facilitates repair as verification metadata at the newly 
constructed node can be computed efficiently from ex- 
isting metadata at healthy nodes. 

We custom design a novel encryption scheme, called 
N Crypt, to protect the privacy of the response blocks. 
N Crypt is constructed in a way that a response block, 
even when encrypted, can be used by the TPA for the 
integrity check. N Crypt employs the andom linear com- 
bination technique of network coding to be compatible 
with SpaceMac verification. NCrypt is semantically secure 
under a chosen plaintext attack (CPA-secure). Next, we 
briefly describe how we use SpaceMac and describe 
NCrypt in detail. 

4.3 The Homomorphic lUIAC: SpaceMac 

In prior work, we designed SpaceMac and used it to com- 
bat pollution attacks in network coding ||20j, ||2l|, pO) , 
pTI. Here, we use SpaceMac to support the aggregation 
of file blocks and tags. SpaceMac consists of a triplet 
of algorithms: Mac, Combine, and Verify. The construc- 
tion of SpaceMac uses a pseudo-random function (PRF) 
Fi : /Ci X (Z X [1, n + m]) -^ F,, where /Ci is the PRF key 
domain and I is the file identifier domain. 

. Mac(fc, id,e) -^ t: The MAC tag t e F^ of a source 
block or encoded block, denoted by e G F"+™, 
under key fc, can be computed by the following 
steps: 

- r ^ (Fi(fc,id,l),--- ,Fi(fc,id,n + m)) . 

- t ^ e • r e Fo . 



Combine((ei,ii, ai) 



t e ¥g oi e 
follows: 



def 



E:=ia.e, GF^ 



{ei,ti,ai)) -^ t: The tag 
is computed as 



t ^ ELi "» t^ e F, 



q ■ 



» Verify(/c, id, e, t) -^ {0, 1}: To verify if i is a valid tag 
of e under key k, we do the following: 

- r ^ (i^i(fc, id,l),--- ,Fi{k,\d,n + m)) . 

- t' <— e -r . 

- If i' = t, output 1 (accept); otherwise, output 
(reject). 

Lemma 1 (Theorem 1 in [201). Assume that Fi is a secure 
PRF. For any fixed q, n, m, SpaceMac is a secure {q, 11,771) 
homomorphic MAC scheme. 

We refer the reader to |20| for the security game 
and proof of SpaceMac. We provide security proof of 
SpaceMac w^hen used in NC-Audit in Section [6.1| If the 
user computes the verification tags for the source blocks 
using Mac, then the storage node can compute a valid 
MAC tag for any encoded block using Combine. The 
security of SpaceMac guarantees that if a block, e', is 
not a linear combination of the source blocks, then the 
storage node can only forge a valid MAC tag for e' 
with probability -. The security when using £ tags is 
improved to \. For clarity, we focus on a single file F 
and thus omit the file identifier id used by the above 
three algorithms in our subsequent discussion. 

4.4 The Random Linear Encryption: NCrypt 

To protect the privacy of the response file block, we need 
to encrypt it. The encryption, however, needs to still 
allow for the verification of the block. Here, we describe 
NCrypt, an encryption scheme that is compatible with 
SpaceMac. In particular, NCrypt will protect n— 1 elements 
of the response block while still allowing SpaceMac 
integrity checking. Only n — 1 elements rather than n is 
protected is because of the technical constraint needed 
to preserve the security guarantee of SpaceMac (shown 
in the proof of the subsequent Theorem |4l. 

Let X denote the vector formed by the first n ~ 1 
elements of vector x. The construction of NCrypt uses 
two PRFs: F2 : /C2 X ([l,n - 1] x [l,n - 1]) -^ F, and 
F3 : /Ca X ({0,1}^ X [l,n - 1]) -^ F,, where /C2 is a 
PRF key domain. NCrypt consists of three probabilistic 
polynomial time algorithms: 

• Setup(fc, f) — > {pi,--- ,p„_i): This algorithm is run 
by the user to setup the encryption scheme. It takes 
as input a secret key k and a vector r 7^ 0, f e F"^^. 
It outputs n — 1 elements in F , w^hich are called 
tagging elements and are used by the encryption. 
The details are as follow: 

- p, ^ (F2(fc,z,l),--- ,F2(/c,z,n-l)) e F^-i, for 

i e [1,71- 1]. 

- Pi ^ r ■ pi e ¥g, for i e [l,n- 1]. 

. Enc(fc,e, (pi, • • • ,pn-i)) -^ (c, {r,p)): This algorithm 
is run by the storage node to encrypt the n— 1 first 
elements of the aggregated response block. It takes 
as input a secret key, k, vector formed by the first n— 



1 elements of the response block, e, and the tagging 
elements, pi,- ■ ■ ,Pn-i- It computes the encryption, 
(c, {r,p)), of e as follows: 

- Compute Pi, i e [1, n — 1], using key fc as in Setup. 

- Choose r uniformly at random: r -s— {0, 1}'^. 

- Compute the masking coefficients: 

P,^F3{k,r,i)e¥g,iorie [l,n-l]. 

- Compute the masking vector: 

n-l 

m^^Ap, eF^-1. 

- Compute c <- e + m e F^~^. 

- Compute p ^ YJi=i P'lPi e ^q ■ 

In essence, the data is masked with a randomly 
chosen vector m e span(pi, • • • , Pn-i). 

• Dec(A;, (c, {r,p))) — > e: This algorithm takes as input 
a secret key, k, and the cipher text, (c, (r,p)). The 
decryption is done as follows: 

- Compute Pi, i E [l,n — 1], using key k as in Setup. 

- Compute /3i ^ F3{k, r, i) g F^, for i e [1, n - 1]. 

- Compute m ^ ^ti A P* € F,"'- 

- Compute e ^— c — m e F"~^. 

Theorem 2. Assume that F2 and F3 are secure PRFs, then 
NCrypt fs a fixed-length private-key encryption scheme for 
messages of length (n — l) x loga q that has indistinguishable 
encryptions under a chosen-plaintext attack. 

Proof. Intuitively, the security of NCrypt holds be- 
cause rh looks completely random to an adversary who 
observes a ciphertext (c, {r,p)) since it is computationally 
difficult for the adversary to compute Pi's and /S/s 
without knowing the secret key fc. The proof follows the 
technique used to prove the security of Construction 3.24 
in||37|. 

We follow the notation in f3^. Denote the CPA 
security experiment of an encryption scheme 11 = 
(Setup, Enc, Dec) and an adversary A by PrivK^^jj. The 
game is as follows: 

• A key fc is chosen uniformly at random from {0, 1}'^. 

• The adversary A is given f,pi, • • • ,p„-i, and oracle 
access to Encj.. It outputs a pair of messages mo and 
mi, both are in F"^^. 

• A random bit b <— {0,1} is chosen, and then a ci- 
phertext c <— Enc(fc, nib, (pi, • • • ,p„_i)) is computed 
and given to A. We call c the challenge ciphertext. 

• The adversary A continues to have oracle access to 
Encfc, and outputs a bit b'. 

• The output of the experiment is defined to be 1 if 
b' = b, and otherwise. In case PrivK^^jj = 1, we 
say that A succeeded. 

Let Hi be an encryption scheme that is exactly the 
same as 11 except that a truly random function /2 is used 
in place of F2. Let Adv[S, F2] be the probability of an 



adversary B with similar runtime to A winning the PRF 
security game. We have 



|Pr[PrivKXn = 1] - PriPrivK^n, 



Adv[B,F2] (1) 



Similarly, let 112 be an encryption scheme that is ex- 
actly the same as Hi except that a truly random function 
/s is used in place of F3. Let Adv[C, F3] be the probability 
of an adversary C with similar runtime to A winning the 
PRF security game: 

= 1] - Pr[PrivKXn, = 1]| = Adv[C, F,] (2) 



|Pr[PrivKXn, 



We claim that for every adversary A that makes at 
most g{X) queries to its encryption oracle {g is a pol5n-io- 
mial function), we have 

1 5(A) 

2 2^' 
Let Tc denote the random string used when generating 

the challenge ciphertext. There are two cases: 

(a) Tc is never used by the oracle to answer any 
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of A's queries: Parse 

(to(^\ • • • ,m("^-^)), and Pi as [pY',--- ,p\" ''')■ From a 
ciphertext returned from an oracle query, the adversary 
can construct the following system of equations by sub- 
tracting the query plaintext from the ciphertext: 



Pip['' 



/3„_ipi'2i=m(i) 



3 (n 



■ Pn~lPn-l 



,("-!) 



„(J) 



Note that pi ' are not all zeros w.h.p. Let Pi be 
unknowns, i E [l,n — 1]. The above system of linear 
equations is consistent regardless of the values of 771 '^^'s 
since the rank of the coefficient matrix is at most n — 1, 
which is the number of unknowns. Also note that the 
knowledge of f and pi, • • • ,Pn-i does not contribute any 
additional equations w.r.t /3i to the above system. Let s 
be the rank of the coefficient matrix. 

Now for any w £ [1, n — 1], assume that all m^^^'^^ , j g 
[l,n — 1], are fixed. Then to^'"^ still can take any value 
in F equally likely because (i) for any value of rrS'^\ 
there is the same number of solutions, which is q"^^^'*, 
and (ii) Pj are chosen uniformly at random from F^. 
Thus, each element of the plaintext, e^™\ is masked with 
a uniformly random value, rrS'^\ independent of other 
masking elements mP^^^j £ [l,n — 1]. Therefore, the 
probability that A outputs b' = b is exactly 1/2, as in 
the case of the one-time pad. 

(b) Tc is used by the oracle to answer at least one of A's 
queries: In this case, A may easily determine which of 
its messages was encrypted. This is because whenever 
the oracle returns a ciphertext, {c,{r,p)), it learns the 
masking vector m associated with r since rii = c — e. 
Since A makes at most g{X) queries, and r is chosen 
uniformly at random, the probability of this event is at 
most g{\)/2^. 

Equation ||3l follows from (a) and (b). Equations lll|, 
(|2l, and ||3} prove the theorem. D 



4.5 The Privacy-Preserving Auditing Scheme: 

NC-Audit 

Now we are ready to describe our symmetric-key based 
auditing protocol, denoted by NC-Audit. In particular, 
NC-Audit is built from SpaceMac and NCrypt as follows: 

Setup phase: 

• The user divides the file into m blocks of size n — 1 
instead of n and pads to each block a random 
element in F . This is necessary as NCrypt encrypts 
only the first n — 1 elements. We still denote 
each padded block with its coding coefficients by 
hi,i e [1,™]. 

• The user runs KeyGen to generate MAC key, fci, 
and encryption key, fc2: 

- KeyGen(l^) ^ (^1,^2): k^M^ {0, 1}\ 

• The user then setups the encryption scheme by 
computing the tagging elements, pi, • • • ,Pri-i: 
-f^(Fi(fci,l),--- ,Fi(fci,n-l)). 

- (pi,- •• ,Pn~i) ^ Setup(fc2,f)- 

• Afterward, the user computes a tag for each source 
block hi using Mac algorithm of SpaceMac: 

- ib, = Mac(fci,b,j). 

• The user computes MAC tags of encoded blocks 
using the Combine algorithm of SpaceMac. Assume 
e = Y^^=i '^i bj/ then its tag is computed as follows: 

- TagGen(e, ki) -^ U = YT=i "i ^b,- 

• Finally, the user sends the encoded blocks, 
ei, • • • ,eM, their tags, t^^,--- ,1^^^, the tagging el- 
ements, pi, ■ • • ,Pri-i/ and the encryption key, fc2, to 
the storage node. The user also sends the coding co- 
efficients, aug(ei),-- ,aug(eM), and the MAC key, 
fci, to the TPA. We assume that the user uses private 
and authentic channels to send fci and k2 w^hile 
using an authentic channel for sending the other 
data. The user then keeps the coding coefficients /or 
repair and the keys but delete all other data. 

Audit phase: 

• The TPA chooses a set of indexes of blocks to be 
audited, I C [1, M], and chooses the coefficients for 
these blocks uniformly at random: a; ^ F , i e I. 
The challenge includes the indexes of the blocks 
and their corresponding coefficients: 

- Prepare chal = {(i, ai) \ i G I}. 

• Gen Proof run by the storage node to generate the 
proof of storage, V , is implemented as follows: 

- Compute the aggregated block: 

e = Yjidi c^i ^i- Parse e as (e, e^")). 

- Compute the aggregated tag: 

- Encrypt the response block: 



(c, (r,p)) ^ Enc{k2,e,{pi,--- ,_p„-i)). 
The node then sends V ~ ((c, (r-jp)), e^"\ i) back to 
the TPA. 

• VerifyProof run by the TPA to verify the proof V is 
implemented as follows: 

- Compute coefficients of e: 

aug(e) =X;iei"«3ug(^i)- 

- Let c = (c I e^"-* | aug(e)), where "|" denotes aug- 
mentation. Return result of Verify(fci, c, t + -p). 

Correctness. The correctness of NC-Audit is guaranteed 
by the following theorem. Its security is proved in Sec- 
tion HI 

Theorem 3. Ij the storage node follows NC-Audit and com- 
putes the aggregated response block using uncorrupted blocks, 
then the TPA will accept the proof. 

Proof Let r= {Fi{k,\d,l),- ■ ■ , i^i(fc, id,n + m)). Note 
that 

c = (c I e(") I aug(e)) = ((§ + m) | e^") | aug(e)) 
= e + (rn I 0, • • • , 0) . 

Thus, in the Verify, 

t' = cr=:er + rhf 



t+Y,|3^P^■^ = t + Y.f^■'P^=*' 



P- 



Therefore, Verify returns 1. Hence, the TPA accepts the 
proof. D 

5 Repair and Data Dynamics 

Here, we discuss how NC-Audit efficiently supports the 
repair of a failed node as well as changes to the data 
made by the user. 

5.1 Support for Node Repair 

When there is a node failure, the user creates a new node 
to replace this node. Based on the coding coefficients 
of the coded blocks at the remaining healthy nodes, 
the user instructs the healthy nodes to send appropriate 
coded blocks to the new node. The new node then lin- 
early combines them, according to the user instruction, 
to construct its own coded blocks. This new node may 
construct the same coded blocks that the failed node 
had {exact repair), or completely different coded blocks 
{functional repair) |[T|. 

Using NC-Audit, the verification tags of the newly 
constructed blocks at the new node do not need to be 
computed by the user. In particular, the healthy nodes 
can send along the verification tags of the coded blocks 
that they send to the new node. The new node can use 
Combine to generate tags corresponding to the coded 
blocks that it needs to construct. Finally, the user sends 
the coding coefficients of the coded blocks at the newly 
constructed node to the TPA so that it can audit this 



new node. As a result, with NC-Audit, there is negligi- 
ble cost, in term of both bandwidth and computation 
of verification metadata, to the user when repairing a 
failed node. This stands in stark contrast with the prior 
integrity checking scheme for NC -based storage |18|, 
which requires the user to download many data blocks 
(equal to the repair bandwidth) and compute security 
metadata for the newly coded blocks herself. 

Last but not least, since the TPA audits the new node 
based on the new set of coefficients, a malicious node 
cannot carry out a replay attack |18| (discussed in Section 
|2.2| ; otherwise, it will not pass the audit. Furthermore, 
we assume that the healthy remaining nodes send valid 
data and tags to the new node. If there is a malicious 
node that sends corrupted data or tags, the storage 
systems is considered polluted. Dealing with pollution 
attacks is out of the scope of this paper; we refer the 
reader to previous work w^hich explicitly combat pollu- 
tion attacks p^\, pSl-pSl, pO), ISl], ||35l, p). 



5.2 Support for Data Dynamics 

Next, we discuss how NC-Audit supports changes that 
the user may want to make to their outsourced data, 
including block update, append, insert, and delete - with 
the first two operations generally considered the most 
important operations for NC-based storage. We stress 
that how each change is carried out by the storage cloud 
is dependent on the coding scheme used by the cloud 
as well as how the cloud is designed. In other words, 
the changes of data itself are orthogonal to and out of 
the scope of this work. Similar to [12 1, we focus on how 
the security metadata can be maintained correctly and 
efficiently when using our scheme, regardless of how 
the data is changed. 

Block Update. Assume the user wants to update the 
source block, hj, for some j E [!,?«]■ Denote the new 
block after the update b'. It first needs to learns the 
tag of hj, which can be done as follows: Assume b^ = 

Y^T=i Q^* ^i' t^s" *b, = Y^T=i Q^* ^e, . For i 7^ 0, the user 
can download tg. from the appropriate storage nodes to 
compute tbj. 

The user then computes the tag tb' of b' under key 
fci using Mac. Finally, it sends fb' and t^, to the TPA 
using an authentic and secure channel. Subsequently, 
whenever challenging a storage node and obtaining 
a response block w^hich involves aj hj, the TPA runs 
VerifyProof with the tag t + Q;j(tb' — ^b ) instead of t. To 

see why this is the case, let e = aj hj+J2i=i ... M-i^j ^i ^» 
be the aggregated response block (before encryption). 
Its corresponding tag that is sent back with the proof 
of possession is t = ajt^. + J2i=i,... .M-.i^j '^i^b,- But 
since hj is now updated, the correct tag must be t' = 
c^j tb'. +J2i=i ... M-i^j '^i tbi- Note that if e is not updated 
correctly then by the security guarantee of SpacelVlac, 
w.h.p. t' is not a valid tag for e. Subsequent updates to 
this j-th block can be carried out similarly but without 
recomputing t\j.. 



This approach requires the TPA to store a verification 
tag for every updated source btock, which is 0{m). This 
overhead is negligible compared to the outsourced data 
0{{n + m)MN)), where n ^ m. Finally, we assume that 
the storage nodes send back correct tags. If one wants 
to consider a stronger threat model where the storage 
nodes may send back bogus tags, then there are two 
possible solutions: (i) modifying the auditing scheme to 
require the user to store the source tags, th.; in this 
case, the additional client storage overhead is 0{m); or 
(ii) a traditional MAC scheme computed on the coding 
coefficient, aug(ei), and verification tag, te^, can be used 
to protect the integrity of the tag. 

Block Append. Assume that the user wants to append 
a source block, b*, to the system. The encoded b* has 
the following form: 



b, = (— b,— o,---,o,i) gf;'+'»+i. 

The user first computes the tag tb. of b* under ki using 
Mac (now for vectors with size n + m + 1) as follows: 

-r^ (Fi(fc,l),--- ,Fi{k,n + m + l)) . 

- ib. <- b* • r e Fq . 
It then sends tb, to all storage nodes that have coded 
packets that involve b*. 

Note that when an append happens, the vector rep- 
resentation of a previous source block, hi,i E [l,?n], 
is appended with a zero. However, its verification tag, 
computed using Mac, remains the same since OxFi(fc, n+ 
m + 1) = 0. Consequently, for coded packets that do not 
involve b*, their tags remain the same: if e = J27Li Q^j b^, 
then its new tag equals is old tag: t'^ = t^ — X]"=i '^i ^hi- 
For coded packets that involve b*, the storage node can 
compute their new tags using tb,: assume a* of b* is 
added to e, then tg' = tg + a* ib, ■ 

Afterwards, the user must send the new coding co- 
efficients, aug(ei),--- ,aug(eM), to the TPA. Note that 
how the system updates its set of coding coefficients 
depends on how the underlining coding scheme handles 
block append. Finally, since the TPA carries out audits 
using this new set of coefficients, if the storage node 
does not update its data and tag correctly, it will not 
pass the subsequent audits. In particular, since the TPA 
computes aug(e) in VerifyProof locally, if the response 
block e (before encryption) is not updated correctly, in 
the proof of Theorem Isj c 7^ e + (rh | 0, • • • ,0). Thus, by 
the security guarantee of SpaceMac, VerifyProof will fail 
w.h.p. 

Block Insert. Assume that the user wants to insert a 
new source block, b*, before a source bock bj, j e [1, m]. 
After an insertion, a previous source block, hi,i > j, will 
have the following form: 



Since b/s coefficient 1 is shifted to the right by 1, its new 
tag computed by Mac no longer equals its old tag: 

r-h, + Fi{k,i + l)^r-h,+Fi{k,i). 

As a result, a straightforward insertion does not work. 

To this end, we take an approach similar to |12|, where 
a block insert is implemented with a block append and a 
mapping. In particular, the block is first appended to the 
system using Block Append above; then the user needs to 
keep a mapping of the index of the appended block to its 
appropriate position. This requires user storage which is 
linear in the number of blocks inserted. 

Block Delete. We assume that the number of blocks to 
be deleted is small relatively to the file size. If a large 
portion of the file is to be deleted then it is best to rerun 
the Setup phase of NC-Audit. Similar to [112), we consider 
deletion of a block as changing it to a special block. Thus, 
updating the metadata to reflect the deletion can be done 
as in the Block Update case. 

In summary, when using NC-Audit, the user can up- 
date integrity metadata very efficiently to support data 
dynamics, i.e., without downloading data blocks. 

6 Security Analysis 

6.1 Data Possession Guarantee 

When using SpaceMac in NC-Audit, some information 
about r in the SpaceMac construction are available to 
the adversary. In particular, the storage node knows the 
following n — 1 equations: Pi ■ f = pi ,i E [l,n — 1]. 
The following theorem states that even when these 
n — 1 equations are exposed, SpaceMac is still a secure 
homomorphic MAC. 

Theorem 4. Assume that Fi is a secure PRF. For any 
fixed q, n, m, assume that a probabilistic polynomial time 
adversary A knows any n — 1 linearly independent vectors, 
Pi , • • • , p„_i, and any n—1 constants, pi,- ■ ■ ,pd, such that 
pi f = Pi, where r is used in the construction 0/ SpaceMac. 
The probability that A wins the SpaceMac security game, 
denoted by Adv[A, SpaceMac], is at most 



PRF-Adv[S, Fi] + 



1 



m+l 



b, = (— b— ,0,--- ,0,1,0,- 



, 0) e F^+™+i 



i+l 



where PRF-Adv[S,Fi] is the probability of an adversary B 
with similar runtime to A winning the PRF security game. 

Proof: The security game, called the Attack Game 1, 
of SpaceMac involves a challenger C and an adversary 
A, and is as follows: 

• Setup. C generates a random key k -^ JC 

• Queries. A adaptively queries C, w^here each query 
is of the form (id, y). For each query, C replies to A 
with the corresponding tag t <— Mac(fc,id,y). 

• Output. A eventually outputs a tuple (id*,y*,t*). 
Up to the time A outputs, it has queried C multiple times. 
Let / denote the number of times A queried C using id* 
and get tags for I vectors, yi, • • ■ ,y;*, of these queries. 



10 



We consider that the adversary wins the security game 
if and only if 

• (y*" , • • • , y* "'^) ^ (trivial forge otherwise), 
. Verify (/c, id*, y*,t*) = 1, and 

• y* ^span(y^--- ,y;*). 

Here, we prove Theorem El with respect to a slightly 
different security game, called Attack Game 2. This 
Attack Game 2 is similar to Attack Game 1, except that in 
the Queries phase, for each distinct id, the space spanned 
by the vectors used in the queries has dimension at 
most m. This Attack Game 2 is stricter but better fits 
the reality: since the dimension of the source space 11 is 
only rn, the adversary must only learn tags of vectors in 
spaces having dimensions at most m. 

The proof is done by using a sequence of games 
denoted Game and Game 1. Let Wq and Wi denote the 
events that A wins the homomorphic MAC security in 
Game and Game 1, respectively. Game is identical to 
Attack Game 2 applied to the scheme Space Mac. Hence, 



Pr[Wo] = Adv[A, SpaceMac] 



(4) 



Game 1 is identical to Game except that the challenger 
C computes r <— {ri,--- ^''n+m)/ where r,; is chosen 
uniformly at random from F : r, ^ F^ instead of 
ri <— F{k, id, i), and everything else remains the same. 
Then, there exists a PRF adversary B such that 



|Pr[Wo] - T'r[Wi]\ = PRF-Adv[6, F] 



(5) 



The complete challenger in Game 1 works as follows: 

Queries. A adaptively queries C, where each query is of 
the form (id, y). If id is already used in m previous query, 
C discards the query. Otherwise, C replies to query i of 
A as follows: 

if id is never used in any of the previous queries: 

r, := (rl, • • • , r;+„), where rj ^ F,, j e [n + m] 
else: 

ri := the one used in the previous response 
send t :— Yi ■ r^ to A 

Output. A eventually outputs a tuple (id*,y*,i*). When 
y* does not equal 0, to determine if A wins the game, 
we compute 

if id* = idi (for some i) then // case (i) 
set r* := r^ 

else // case (ii) 

set r* := (r^, • • • , r*+,„), where r* ^¥q,i e[n + to] 
Let I denote the number of times A queried C using id* 
and get tags for I vectors, y^, • • • ,y;*, of these queries. 
The adversary wins the game, i.e., event Wi happens, if 
and only if 



t* = y* • r* , and 

y* ^ span(y*,--- ,y,*). 



(6) 
(7) 



We will show that Pr[W^i] = i. Let T be the event 
that A outputs a tuple with a completely new id*, i.e., 
A never made queries using id* before. 



• When T happens, i.e., in case (ii), since r* 's 
are indistinguishable from random values and 
(y^+^\--- ,yr+"^) ^ 0, the right hand side of 
equation ||6l is a completely random value in Fg. Thus, 



Fr[Wi AT] = - Pr[T] 

q 



(8) 



• When T does not happen, i.e., in case (i): r* of 
equation ||6} equals r^ for some i, and r* has been 
used to generate tags for vectors y| , • • • , y^* . In this 
case, we proceed by showing that for a fixed y*, t* 
looks indistinguishable from a random value in ¥q. Let 
n* — span(y|, • • • ,y;*), which has dimension S < m. Let 
{bi, • • • , b^} be a basis of II*. Let r^,- ■ ■ , r*^,„ be the 
unknowns. The given prior knowledge, the queries, and 
the output form the following system of linear equations: 
Pi-f* =pi; ••• ; prf-r ^Pn-i; yl-r* =ty*; ■■■ ;y*-r* = 

ty-y*-r*=r. 

This system is equivalent to the following system: 
Pi-f* =pi; ••• ; prf.f* =_p„-i; bi-r* ==tbi; ■•• ; b^-r* = 
tb/j', y* • r* — t*. Note that t^,. is a linear combination 
of some ty». Without loss of generality, assume pi and 
hi are linearly independent (otherwise, the number of 
equations the adversary learns would be less). Since the 
coefficients of y* are not all zeros and y* ^ II*, y* 
is linearly independent of Pi and b^. Thus, the above 
system oi n + m unknowns is consistent regardless of the 
value of t* because the coefficient matrix has rank n + S, 
which equals the number of equations. Furthermore, for 
any value t* , the solution space always has the same 
size q"^^^ . Thus, for a fixed y*, its valid tag t* could be 
any value in F^ equally likely, given that r* 's are chosen 
uniformly at random from F,. As a result, the probability 
that the adversary chooses a correct t* is 1/q. Thus, 



Fr[Wi A -T] = - Pr[^r] . 

q 

From equations llsl and ||9|, we have 



(9) 



Pr[Wi\ = Pr[Wi A T] + Pr[Wi A -T] = - . (10) 

Equations H, ||5}, and (TO) together prove the theorem. 

D 
Now, we are ready to prove the data possession guar- 
antee of NC-Audit. 

Theorem 5. With probability at least 1 — -, the storage node 
can pass a check if and only if it possesses the blocks specified 
in the challenge of the check. 

Proof: Theorem [3] shows that if the storage node 
possesses the data then it can pass the check. It remains 
to show that if the node passes the check then it pos- 
sesses the corresponding blocks w.h.p. Let us prove the 
converse, i.e., if there are corrupted or missing blocks, 
the node will fail the check w.h.p. 

For simplicity, we assume that when responding to 
a challenge involving a block that no longer exists in 
the storage, the node replaces it with a block chosen 
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uniformly at random in F^+'". Denote the correct, un- 
encr5^ted aggregated block by e, i.e., e = Yliei^^i^i- 
Denote the data of the response block actually computed 
by the storage node by a and denote (a | aug(e)) by a. 

If there is at least one error in the data of one of 
the block or there is at least one missing block, then 
Prob[a = e] < - because as are chosen uniformly at 
random from F . Note that e is in the source space: 
e e n, thus if a 7^ e then a ^ 11. Therefore, Prob[a e 11] = 
Prob[a = e] < i (a). 

Furthermore, the security of SpaceMac from Theorem 
|4] guarantees that the node can provide a valid tag of 
a ^ n with probability at most -. Finally, without loss 
of generality, we can ignore the encryption because if 
the node already knows a valid tag of a, it can provide 
the correct encryption to pass the check. Meanwhile, if 
the node does not know a valid tag of a, its chance of 
forging a valid tag for the cipher text c is still bounded 
by the security guarantee of SpaceMac, which is at most 

As a result, from (a) and (b), the probability of passing 
the check when there is error or missing block is at most 

I- ° 

NC-Audit actually provides a stronger data possession 

guarantee. It ensures that the user can extract the data 

stored on the storage node just by collecting response of 

the node from the checking protocol. We provide proof 

of retrievability based on the theoretical framework of 

[[T3|, which is derived from |10| and |9|. 



Theorem 6. Assume that the storage node responses correctly 
to a fraction 1 — e of challenge uniformly, where e < ^. 
The user can extract ei , • • • , gm by performing 7 challenge- 
response interactions with the storage node with high proba- 
bility (depending on 7, e, and q). 

Proof: Theorem [5] implies that if a node responses 
correctly to a fraction of challenge, then with probability 
at least 1 — -, the response block is a correct linear 
combination of the blocks stored at the node. For a 
challenge coefficient vector (ai,-- ,aM), the user can 
challenge the node using a number of constant-multiples 
of the vector, e.g., (cai, • • • ,caM) for some constant c, 
to learn the responses (including incorrect responses), 
and then use majority decoding to learn the correct 
equation J2i=i '^i^i — ^' ^vhere d is some constant 
vector. By collecting AI linearly independent equations 
of this form, the user can solve for ei , • • • , ga/ using 
Gaussian elimination. 

Note that for a fixed f < \, the probability of learning 
one correct equation depends on both q and the number 
of queries made using the multiples of the corresponding 
coefficient vector For a fixed q, this probability can 
be made arbitrarily high by increasing the number of 
queries. D 

6.2 Privacy-Preserving Guarantee 

We summarize the privacy guarantee of NC-Audit in the 
following theorem. 



Theorem 7. From the response of the storage node, the TPA 
does not learn any information about the outsourced data, 
except for the information that could be derived from the MAC 
tag. 

Proof. The claim is a direct consequence of Theorem 
[2] and the fact that the padding element is chosen ran- 
domly. D 
We stress that the information derived from the MAC 
tags are not sufficient to derive the outsourced data. 
To be concrete, each tag is a weighted sum of symbols 
belonging to the same block. Also, the outsourced data 
consists of TO X n field symbols, which could be consid- 
ered as unknowns of a system of linear equations, and 
the knowledge given by the tags and the MAC key only 
gives at most n linearly independent equations. 

7 Perforimance Evaluation 

7.1 Client Storage Overhead 

NC-Audit requires the user and the TPA to store the 
coding coefficients, which is in 0{mMN) space. The 
user needs the coefficients to carry out repair and block 
update, while the TPA needs the coefficients to carry out 
audits. In any case, this overhead is orders of magnitude 
less than the outsourced data, which is in 0{{n+m)MN) 
space, since n ^ to. In fact, in a practical NC storage 
cloud, the storage needed to store the coding coefficients 
could be kept less than 16 B {i.e., constant storage) while 
being able to support arbitrary file size ["241. Table IT] 
compares client storage overhead of NC-Audit and other 
recent schemes ||lT|, ||15|, ||18|. 



7.2 Bandwidth Overhead 

Integrity Checking. For each audit round, the major 
communication cost is the cost of sending the proof of 
possession from the storage node to the TPA, which 
is dominated by the size of the (encrypted) data bock. 
Thanks to homormophic property of SpaceMac, blocks 
in the challenge can be aggregated. We achieve similar 
bandwidth overhead compared to prior schemes for 
integrity checking of cloud data Q, ||llj, ||l5j, ||l8j, i.e., 
the proof of possession for multiple blocks contains only 
a single block (of size varying from 4 KB [5| to 1.6 MB 
1181). 

We note that a coding scheme can be modified to 
support small block size by subdividing source blocks. 
For instance, to halve the size of a block, each source 
block hi can be divided into two equal blocks b^ 1 
and b; 2- The global coefficients of the blocks are then 
changed as follow: 



2m 



^n+2m 



b.,1 = (— b,,i— , 0, • • • , 0, 1, 0, • • • , 0) e F;'- 

2i-l 

2m 

b..2 = (— b,,2— b,--- ,0,l,0,---,d) e F^+2" . 

2i 
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The coding scheme is kept the same: the coding oper- 
ations performed on b^ are translated to similar coding 
operations done on both b^ i and b, 2- Note however that 
the overhead of the coefficients is doubled in this case. 
In general, it increases linearly in the number of source 
blocks. 

Repairing and Updating. As shown in Section [5] when 
using NC- Audit, the user does not need to download any 
data block to repair failed node or update the outsourced 
data. In contrast, in [18] , the user needs to download an 
amount of data equal to the repair bandwidth to setup 
integrity metadata for the new coded blocks herself. 

Encryption. The amount of additional bandwidth to 
support encryption is small. In particular, N Crypt re- 
quires the storage node to send with the encrypted 
block, c, the random value, r, of size A (typically 80 
bits 1 5 1), the auxiliary tag p, and the random padding 
element e^"^, w^hich are both of size log2 q. These are 
negligible compared to the block size: n logj q (0.3% for 
(7 = 2^?l = 4x 210). 

7.3 Computational Overhead 

We first analyze the cost of each operation in NC-Audit 
by the number of finite filed multiplications involved, 
which is the dominating cost factor. We then present the 
cost of each operation from our real implementation in 
Java. We omit the cost of computing PRF values that 
do not take as input random seeds since they can be 
precomputed. 

Integrity Checking with Encryption: 

1. Storage Node Overhead: In NC-Audit, the cost to compute 
the proof of possession includes the cost to compute 
(i) the aggregated response block, e, (ii) the response 
tag, t, (iii) the masking vector, m, and the auxiliary 
element, p. The total cost is dominated by the cost to 
compute e and m. m can be precomputed in advance 
as it is independent of the challenge. Let C be the 
average number of blocks specified in a challenge. The 
average cost to compute a response per challenge is 
C X n multiplications with precomputations of rn and 
(C + n— 1) X n without. 

2. TPA Overhead: In NC-Audit, verifying a proof of posses- 
sion can be done very efficiently. In particular, the cost 
to verify include the time to (i) compute the coefficients 
of the response block and (ii) run the Verify of SpacelVlac. 
Let i be the number of tags used. The total cost is 
C X m + £ X {n + m) multiplications. 

Repairing and Updating: 

As described in Section Isl repairing a failed node 
does not incur any computation cost at the user side. 
Updating a block also incurs very small amount of 
computational overhead by the user. In particular, the 
dominant cost is due to computation of the tag of the 
new block (either to be updated, inserted, or appended), 
which entails n + m field multiplications. 



Implementation: 

We implement NC-Audit in Java to compare its perfor- 
mance with recent schemes fill , [ |15| , flSl - For a fair of 
comparison with (llj, (Tsj, we use q = 2^ and ^ = 10 
to provide 80-bit security, and we also set block size to 
4 KB (?i = 4 X 2^°), TO = 500, and the number of blocks 
indicated by a challenge to C = 300. We stress that the 
choice of parameters may be different in a practical NC 
storage system, e.g., in |24|, a block size could be as big 
as 4 MB while the storage space taken by the coefficients 
could be kept below 16 B. 

We implement finite field multiplications in F28 by 
table look-ups and additions using XORs. We also pre- 
computed values that do not depend on the challenges. 

Table IT] compares both the bandwidth overhead and 
computational overhead of different remote data in- 
tegrity checking schemes. The reported numbers for 
flSl and [11 1 a re taken from |11|. (The overhead of the 
scheme in |15| is similar to the public-key based scheme 
in |9|.) We refer the reader to |11 1 for the detailed setup. 
We implement the checking scheme in |18| ourselves. 
For this scheme, we use AES with CBC mode from Java 
crypto library to decrypt coefficients. We refer the reader 
to Appendix A in |18| for the detailed description of 
this scheme. The number reported for NC-Audit and the 
scheme in |18| are the average of 100 runs on a computer 
with 2.8 Ghz CPU and 32 GB RAM. 

Table IT] shows that NC-Audit manages to achieve top 
bandwidth efficiency, and at the same time, having 
very small computational overhead. The computational 
overhead of NC-Audit is orders of magnitude smaller 
than those of |15| and |11|. This is due to the fact that 
NC-Audit is symmetric-key based while the schemes in 
1 15 1 and |11| are public-key based and make heavily 
use of expensive bilinear mapping operations. We also 
note that the scheme in |18| achieves similar storage 
node computational overhead as it is also symmetric-key 
based; however, due to the cost of executing C x ni — 
150, 000 numbers of decryption for the coefficients, the 
computational overhead of the TPA is much larger, in 
the order of seconds. 

8 Conclusion 

In this work, we propose NC-Audit, a remote data in- 
tegrity checking scheme for NC-based storage systems. 
NC-Audit is built based on a homomorphic MAC scheme 
custom made for network coding, SpacelVlac, and a novel 
CPA-secure encryption scheme, NCrypt. NC-Audit allows 
for efficient integrity checking, supports repair of failed 
node and data d3n-iamics (including block update, ap- 
pend, insert, and delete), and prevents leakage of the 
outsourced data when the audit is done by a third party. 
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